Abstract: Tackling irrelevant emails have become part of every email user’s activity. Emails that seem valid are received in the inbox and, sometimes relevant emails are directed to spam. Another aspect of the problem is that due to very high number of incoming emails, it is very difficult to identify the required ones easily. In this process, users waste so much of their time, energy and efforts by sifting through irrelevant mails also in which they have no interest. Sometimes users also get frustrated getting such junk mails frequently. To support ease of access, emails are to be categorized based on the type of information they contain, which will help a person to identify required mails even before opening it. This paper involves development of a feasible solution to this problem by identifying the real sender using past email patterns and features. The proposed project uses this solution to solve the problem of email categorization also. This paper uses machine learning algorithm for detecting the actual Email composer. Different Semantic, Syntactic, and Lexical features of the incoming will be considered to implement the project. Features like Ngram, Lemmatization, creating personalized vocabulary, and observation of patterns are utilized. Algorithms like Lesk will be used to find the meaning according to the context of the text. A database like WordNet helps to find relevant words in the text. Machine learning will be used to learn the different features and create the training data. After the framework is trained, testing data will be used to assess the system. As the system is tested, it will continue to learn from the input data to make it better. Once the machine is trained, the system can start working as a fraud detection system which identifies the real sender, and categorize emails.
Keywords: Text Mining, Ngram (Unigram, Bigram, Trigram), Lemmatization, Wordnet, Enron E-mail Dataset.